# Lab 25 - k-Nearest Neighbors classifier 1

We will continue using the Titanic training and test data from [Kaggle](https://www.kaggle.com/c/titanic) from Lab 24.

First, we need to install a new library, called [scikit-learn](https://scikit-learn.org/stable/).

In [None]:
!pip install --user sklearn

Next import the necessary libraries.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
%matplotlib inline

### Loading and cleaning the data

Read the CSV file `train.csv` into the dataframe `train`.

Display your `train` dataframe below to check it was created properly.

For reference, let's get the summaries of all quantitative columns:

And the summaries of all qualitative columns:

As in Lab 24, we can see there is some missing data in the `Age`, `Cabin`, and `Embarked` columns. With previous datasets, we have simply removed any rows with missing data. Today we will take a different approach, and replace the missing `Age` and `Embarked` data with the most likely value. We won't use the `Cabin` column in the classification, so we don't worry about the missing data in it.

First, let's replace any missing ages with the median age. Compute the median age, and store it in the variable `median_age`.

To fill the NaN values in the `Age` column, type and run this code below: `train["Age"] = train["Age"].fillna(median_age)`

To check this worked, display the quantitative column summaries again:

Did the mean age change? Did the median age change? Does this make sense?

### First prediction

Let's use the quantitative columns of age, fare, SibSp, and Parch to make a prediction.

First we will make a data frame with only those columns.

Set the variable `y` to be the `Survived` column.

Now we will split our (training) data into a training and a test set. We do this to be able to easily check our predications without using Kaggle (which takes some time).

In [None]:
X_train,X_test,y_train, y_test =train_test_split(X,y)

Display the variables `X_train`, `X_test`, `y_train`, `y_test`. What is stored in them?

Now we will create a variable to store information about our k-nearest neighbors classifier:

In [None]:
knn = KNeighborsClassifier(n_neighbors=3)

How many neighbors is the classifier using?

Next we will *fit* our classifier using the training data. This means the classifier stores the information about the coordinates of each training data point and whether it is a survivor in a way that will be make the next step easy to compute. 

In [None]:
knn.fit(X_train, y_train)

Finally, we will make a prediction about the testing data.

In [None]:
y_pred = knn.predict(X_test)

Display `y_pred`. Is it what you expected?

Finally, let's compute the accuracy of our predictions.

In [None]:
knn.score(y_test,y_pred)

How accurate was our model? We can make it more accurate by adding the qualitative information about the passengers.

### Adding the qualitative column information

Next, we'll replace the missing value in the `Embarked` column with the mode of the column. First compute and display the mode.

What's the mode? How is it stored: as a string or as a Series?

Since the mode is stored as as a Series, the easiest way to replace the missing values is by directly using `"S"` as the parameter for `fillna()`.

Replace the missing values in the Embarked column with `S`:

Check that your code worked by displaying the summary of the qualitative columns:

### Optional: Effect of Sex, Pclass, and Embarkedon survival

Let's look at the effect of Sex, Pclass, and Embarked on survival. Create two new dataframes, one containing only survivors and one containing only people who perished.

Create a bar chart of sex of the survivors.

Who was more likely to survive?

Now create a bar chart of the people who did not survive.

Who was more likely to not survive?

Let's look at passenger class (`Pclass`). Create a bar chart of the passenger class of the survivors.

Which class of passengers was most likely to survive?

Now create a bar chart of the passenger class of the passengers who did not survive.

Which class of passengers was most likely to not survive?

Finally, create a bar chart of the ports the surviving passengers embarked at and a bar chart of the ports the passengers who did not survive embarked at.

Is there a difference in the distributions of embarkation ports for the surviving and non-surviving passengers?

### Creating dummy variables

The k-nearest neighbor classifier can only use quantitative data. However, we saw that sex, and passenger class (`Pclass`) played a large role in whether someone survived or not, and the port of embarkation also played a (less) role. To use these qualitative columns, we can convert them into quantitative data using *dummy variables*. A *dummy* or *indicator* variable is a variable that takes the value 0 or 1 depending on whether that data point is in some category.

Run the following code to create dummy variables for these columns and display the new dataframe:

In [None]:
train2 = pd.get_dummies(train, columns = ["Pclass","Sex","Embarked"], drop_first = True)
train2.head()

What happened?

Each of the qualitative columns was replaced by the one or two dummy variable columns. For example, `Sex` was replaced by `Sex_male` which contains 1 if the passenger was male and 0 if the passenger was female.

Why is there no `Sex_female` column?

We are almost ready to do our classification. We just need to drop all remaining qualitative columns.

In [None]:
train2.drop("Cabin",axis = 1,inplace = True)
train2.drop("Name",axis = 1,inplace = True)
train2.drop("Ticket",axis = 1,inplace = True)

Here is another way to make our `X` dataframe:

In [None]:
X2 = train2.drop("Survived",axis=1)

Next we split our data into training and test sets. Use `X2` and `y` from above.

Let's run the classifier. First create it.

Next fit the training data.

Next make a prediction about the test data.

Finally compute the accuracy of this model.

Did the accuracy improve? What happens if you change the number of neighbors?

What happens if you use few columns?